Chapter 38: Statistical methods for corpus exploitation

نویسنده

Marco Baroni

چکیده

vs. concrete nouns). As an example of a classification into multiple categories, word tokens might be classified into syntactic categories such as noun, verb, adjective, adverb, etc., with an “other” class for minor syntactic categories and problematic tokens. A chi-squared test might then be performed to compare the frequencies of these categories in samples from two genres. As another example, one might classify sentences according to the semantic class of their subject, and then compare the frequency of these semantic classes in samples of the populations of sentences headed by true intransitive vs. unaccusative verbs. It is not always obvious which characteristics should be operationalized as a classification of the tokens into types, and which should rather be operationalized in terms of different populations the tokens belong to. In some cases, it might make more sense to frame the task we just discussed in terms of the distribution of verb types across populations of sentences with different kinds of subjects, rather than vice versa. This decision, again, will depend on the linguistic question we want to answer. In corpus linguistics, lexical classifications also play an important role. In this case, types are the distinct word forms or lemmas found in a corpus (or sequences of word forms or lemmas). Lexical classifications may lead to extremely small proportions π (sometimes measured in occurrences per million words) and huge differences between populations in the two-sample setting. Article 57 discusses some of the relevant methodologies in the context of collocation extraction. The examples we just discussed give an idea of the range of linguistic problems that can be studied using the simple methods based on count data described in this Article. Other problems (or the same problems viewed from a different angle) might require other techniques, such as those mentioned in the next two Sections. For example, our study of passives could proceed with a logistic regression (see Section 8), where we look at which factors have a significant effect on whether a sentence is in the passive voice or not. In any case, it will be fundamental for linguists interested in statistical methods to frame their questions in terms of populations, samples, types and tokens. 7 Non-randomness and the unit of sampling So far, we have always made the (often tacit) assumption that the observed data (i.e., the corpus) are a random sample of tokens of the relevant kind (e.g., in our running example of passives, a sentence) from the population. Most obviously, we have compared a corpus study to drawing balls from an urn in Section 2, which allowed us to predict the sampling distribution of observed frequencies. However, a realistic corpus will rarely be built by sampling individual tokens, but rather as a collection of contiguous stretches of text or even entire documents (such as books, newspaper editions, etc.). For example, the Brown corpus consists of 2,000-word excerpts from 500 different books (we will refer to these excerpts as “texts” in the following). The discrepancy between the unit of measurement (a token) and the unit of sampling (which will often contain hundreds or thousands of tokens) is particularly obvious for lexical phenomena, where tokens correspond to single words. Imagine the cost of building the Brown corpus by sampling a single word each from a million different books rather than 2,000 words each from only 500 different books! Even in our example, where each token corresponds to an entire sentence, the unit of sampling is much larger than the unit of measurement: each text in the Brown contains roughly between 50 and 200 sentences. This need not be a problem for the statistical analysis, as long as each text is itself a random sample of tokens from the population, or at least sufficiently similar to one. However, various factors, such as the personal style of

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Exploiting a Large Strongly Comparable Corpus

This article describes a large comparable corpus for Basque and Spanish and the methods employed to build a parallel resource from the original data. The EITB corpus, a strongly comparable corpus in the news domain, is to be shared with the research community, as an aid for the development and testing of methods in comparable corpora exploitation, and as basis for the improvement of data-driven...

متن کامل

An open diachronic corpus of historical Spanish: annotation criteria and automatic modernisation of spelling

The impact-es diachronic corpus of historical Spanish compiles over one hundred books —containing approximately 8 million words— in addition to a complementary lexicon which links more than 10 thousand lemmas with attestations of the different variants found in the documents. This textual corpus and the accompanying lexicon have been released under an open license (Creative Commons by-nc-sa) in...

متن کامل

Toward hierarchical models for statistical machine translation of inflected languages

In statistical machine translation, correspondences between the words in the source and the target language are learned from bilingual corpora on the basis of so called alignment models. Existing statistical systems for MT often treat different derivatives of the same lemma as if they were independent of each other. In this paper we argue that a better exploitation of the bilingual training dat...

متن کامل

Theory-driven and Corpus-driven Computational Linguistics and the Use of Corpora

Computational linguistics and corpus linguistics are closely-related disciplines: they both exploit electronic corpora, extract various kinds of linguistic information from them, and make use of the same methods to acquire this information. Moreover, both were heavily affected by "paradigm shifts" from the prevailing empiricism of the 1950s, to rationalism, then back again with a revival of emp...

متن کامل

The co-evolution of syntactic and pragmatic complexity: diachronic and cross- linguistic aspects of pseudoclefts

This chapter examines the diachronic rise of a syntactically and pragmatically complex construction type: pseudoclefts. Given that cleft constructions combine available components of grammar – relative clauses and copular clauses – do they arise in full-fledged form? If they emerge gradually, what constrains their development? We first present a corpus-based analysis of the history of English p...

متن کامل

Statistical Methods for Low-frequency and Rare Genetic Variants

................................................................................................................................................................... xi Chapter 1: Introduction ...................................................................................................................................... 1 Chapter 2: Recommended joint and meta-analysis strategies for case-co...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Chapter 38: Statistical methods for corpus exploitation

نویسنده

چکیده

منابع مشابه

Exploiting a Large Strongly Comparable Corpus

An open diachronic corpus of historical Spanish: annotation criteria and automatic modernisation of spelling

Toward hierarchical models for statistical machine translation of inflected languages

Theory-driven and Corpus-driven Computational Linguistics and the Use of Corpora

The co-evolution of syntactic and pragmatic complexity: diachronic and cross- linguistic aspects of pseudoclefts

Statistical Methods for Low-frequency and Rare Genetic Variants

عنوان ژورنال:

اشتراک گذاری